Entrez Direct入门

2019-04-11

Entrez Direct（EDirect）提供从UNIX终端窗口访问NCBI的互连数据库套件（发布，序列，结构，基因，变异，表达等）。函数从命令行参数中获取搜索项。将各个操作组合在一起以构建多步查询。记录检索和格式化通常会完成整个过程。

EDirect还包括一个参数驱动函数，它简化了从文档摘要或结构化XML格式返回的其他结果中提取数据的过程。这可以消除编写自定义软件以回答临时问题的需要。查询可以在EDirect命令和UNIX实用程序或脚本之间无缝移动，以执行无法在Entrez中完全完成的操作。

EDirect 官网

测试平台

MacMini 2.6 GHz Intel Core i5，macOS Mojave 10.14.2

安装

进入工作文件夹

cd work

创建src文件夹

mkdir src

进入src文件夹

cd src

下载edirect

1	curl ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip -O

解压缩

1	unzip edirect.zip

进入edirect文件夹并查看

1 2	cd edirect ls

你可能已经发现setup，没错，这个就是安装文件

1	./setup.sh

稍（喝）等（杯）片（茶）刻（吧），当看到输出

Trying to establish local installations of any missing Perl modules
(as logged in /Users/adu/work/src/edirect/setup-deps.log).
Please be patient, as this step may take a little while.
Entrez Direct has been successfully downloaded and installed.
In order to complete the configuration process, please execute the following:
  echo "source ~/.bash_profile" >> $HOME/.bashrc
  echo "export PATH=\${PATH}:/Users/adu/work/src/edirect" >> $HOME/.bash_profile
or manually edit the PATH variable assignment in your .bash_profile file.

恭喜安装成功！

配置环境变量

把上面打印出来的 echo “export PATH=\${PATH}:/Users/adu/work/src/edirect” >> $HOME/.bash_profile 放到终端中，并回车已告诉系统EDirect的安装位置。注意每个人的安装路径都不一样，不要复制这里的应该复制你终端打印出来的。

1	echo "export PATH=\${PATH}:/Users/adu/work/src/edirect" >> $HOME/.bash_profile

1	esearch -help

如果出现 -bash: esearch: command not found 说明环境变量没有配置正确。
看到 esearch 11.1，说明可以愉快的使用 eDirect了

Entrez Direct 函数

导航功能支持Entrez数据库中的探索：

esearch使用索引字段中的术语执行新的Entrez搜索。
elink查找邻居（在数据库中）或链接（在数据库之间）。
efilter过滤或限制先前查询的结果。

可以指定格式或文档摘要检索记录：

efetch以指定格式下载记录或报告。

无需编写程序即可提取XML结果中的所需字段：

xtract将EDirect XML输出转换为数据值表。

还提供了几个附加功能：

einfo获取Entrez数据库中索引字段的信息。
epost上传唯一标识符（UID）或序列登录号。
nquire向网页或CGI服务发送URL请求。

esearch语法

1	esearch -db databaseName -query queryString

databaseName(required)
数据库名字。指定数据库里查询
queryString(required)
查询关键字。在所在的数据库里查询的关键字

例子

在pubmed数据库中查询 “opsin gene conversion”

1	esearch -db pubmed -query "opsin gene conversion"

#######构建多步查询
edirect允许单独执行各个操作，通过使用竖线（“|”）Unix管道符号将它们组合成一个多步骤查询。从esearch到elink：

1	esearch -db pubmed -query "opsin gene conversion" \| elink -related

将查找初始结果的相关文章（预先计算的PubMed邻居）。

#######在多行上写入命令
通过在按回车键之前立即键入反斜杠（“\”）unix转义符，可以在下一行继续查询。继续查询链接到相关文章中发布的所有蛋白质序列：

1
2
3

esearch -db pubmed -query "opsin gene conversion" | \
  elink -related | \
  elink -target protein

垂直条管道符号还允许在下一行继续查询。

#######检索PubMed报告
通过管道将pubmed查询结果传送到efetch并指定“abstract”格式：

1 2	esearch -db pubmed -query "lycopene cyclase" \| efetch -format abstract

使用efetch格式的“medline”可生成一份报告，该报告可输入通用书目管理软件包中：

#######检索序列报告
核苷酸和蛋白质记录可以用fasta格式下载：

1 2	esearch -db protein -query "lycopene cyclase" \| efetch -format fasta

其他FASTA格式的变体是fasta-cds-na、fasta-cds-aa和gene-fasta
序列记录也可以以genbank（-format gb）或genpept（-format gp）格式文件获得，这些文件具有注释序列特定区域的功能：

搜索和筛选

####### 限制查询结果
目前的结果可以通过Entrez中的进一步术语搜索来改进（在蛋白质数据库中用于将BLAST邻居限制为分类子集）：

1
2
3

esearch -db pubmed -query "opsin gene conversion" |
  elink -related |
  efilter -query "tetrachromacy"

结果也可以按时间过滤。例如，以下语句：

1 2	efilter -days 60 -datetype PDAT efilter -mindate 1990 -maxdate 1999 -datetype PDAT

将结果分别限制在前两个月或1990年代发表的文章中。

####### 索引字段的合格查询

可以通过在括号中输入索引字段缩写来限定esearch或efilter中的查询字词。布尔运算符和括号也可以在查询表达式中用于更复杂的搜索。

PubMed查询的常用字段包括：

Item	Value
[AFFL]	Affiliation
[ALL]	All Fields
[AUTH]	Author
[FAUT]	Author - First
[LAUT]	Author - Last
[PDAT]	Date - Publication
[FILT]	Filter
[JOUR]	Journal
[LANG]	Language
[MAJR]	MeSH Major Topic
[SUBH]	MeSH Subheading
[MESH]	MeSH Terms
[PTYP]	Publication Type
[WORD]	Text Word
[TITL]	Title
[TIAB]	Title/Abstract
[UID]	UID

一个合格的查询看起来像：

1	"Tager HS [AUTH] AND glucagon [TIAB]"

结果如下

Mac-mini:work adu$ esearch -db pubmed -query "Tager HS [AUTH] AND glucagon [TIAB]"
<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>NCID_1_123525658_130.14.22.76_9001_1555050683_279099027_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>24</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

将搜索结果限制为PubMed子集的过滤器包括：

Item	Value
humans	[MESH]
pharmacokinetics	[MESH]
chemically induced	[SUBH]
all child	[FILT]
english	[FILT]
freetext	[FILT]
has abstract	[FILT]
historical article	[FILT]
randomized controlled trial	[FILT]
clinical trial, phase ii	[PTYP]
review	[PTYP]

序列数据库使用一组不同的搜索字段编制索引，包括：

Item	Value
[ACCN]	Accession
[ALL]	All Fields
[AUTH]	Author
[GPRJ]	BioProject
[ECNO]	EC/RN Number
[FKEY]	Feature key
[FILT]	Filter
[GENE]	Gene Name
[JOUR]	Journal
[KYWD]	Keyword
[MLWT]	Molecular Weight
[ORGN]	Organism
[PACC]	Primary Accession
[PROP]	Properties
[PROT]	Protein Name
[SQID]	SeqID String
[SLEN]	Sequence Length
[SUBS]	Substance Name
[WORD]	Text Word
[TITL]	Title
[UID]	UID

并且蛋白质数据库中的样本查询是：

1	"alcohol dehydrogenase [PROT] NOT (bacteria [ORGN] OR fungi [ORGN])"

结果如下

Mac-mini:work adu$ esearch -db protein -query "alcohol dehydrogenase [PROT] NOT (bacteria [ORGN] OR fungi [ORGN])"
<ENTREZ_DIRECT>
  <Db>protein</Db>
  <WebEnv>NCID_1_123764783_130.14.18.97_9001_1555052879_1098401792_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>8402</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

序列数据库中子集过滤器的其他示例如下：

Item	Value
mammalia	[ORGN]
mammalia	[ORGN:noexp]
cds	[FKEY]
lacz	[GENE]
beta galactosidase	[PROT]
protein snp	[FILT]
reviewed	[FILT]
country united kingdom glasgow	[TEXT]
biomol genomic	[PROP]
dbxref flybase	[PROP]
gbdiv phg	[PROP]
phylogenetic study	[PROP]
sequence from mitochondrion	[PROP]
src cultivar	[PROP]
srcdb refseq validated	[PROP]
150:200	[SLEN]

（计算的分子量（MLWT）字段仅针对蛋白质（和结构）而非核苷酸编制索引。）

####### 检查中间结果

EDirect将中间结果存储在Entrez历史服务器上。 EDirect导航功能生成一个自定义XML消息，其中包含相关字段（数据库，Web环境，查询键和记录计数），可以读取管道中的下一个命令。

在添加下一步骤之前，可以检查查询中每个步骤的结果以确认预期的行为。 ENTREZ_DIRECT对象中的Count字段包含上一步返回的记录数。查询成功的一个很好的衡量标准是合理的（非零）计数值。例如：

esearch -db protein -query "NP_567004 [ACCN]" |
  elink -related |
  efilter -query "28000:30000 [MLWT]" |
  elink -target structure |
  efilter -query "0:2 [RESO]"

结果如下：

Mac-mini:work adu$ esearch -db protein -query "NP_567004 [ACCN]" |   elink -related |   efilter -query "28000:30000 [MLWT]" |   elink -target structure |   efilter -query "0:2 [RESO]"
Retrying elink, step 2: callMLink: Timeout waiting for response from MegaLink server (3)
Retrying elink, step 2: callMLink: Timeout waiting for response from MegaLink server (3)
Retrying elink, step 2: callMLink: Timeout waiting for response from MegaLink server (3)
ERROR in link output: callMLink: Timeout waiting for response from MegaLink server (3)
WebEnv: NCID_1_123813944_130.14.22.76_9001_1555053764_1622274016_0MetA0_S_MegaStore
URL: dbfrom=protein&db=protein&query_key=1&WebEnv=NCID_1_123813944_130.14.22.76_9001_1555053764_1622274016_0MetA0_S_MegaStore&cmd=neighbor_history&linkname=protein_protein
Result: <?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD elink 20101123//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20101123/elink.dtd">
<eLinkResult>
<LinkSet>
  <DbFrom>protein</DbFrom>
  <IdList>
    <Id>18410104</Id>
  </IdList>
  <LinkSetDbHistory>
    <DbTo>protein</DbTo>
    <LinkName>protein_protein</LinkName>
  <ERROR>callMLink: Timeout waiting for response from MegaLink server (3)</ERROR>
  </LinkSetDbHistory>
  <WebEnv>NCID_1_123813944_130.14.22.76_9001_1555053764_1622274016_0MetA0_S_MegaStore</WebEnv>
</LinkSet>
</eLinkResult>
ERROR in filt input: callMLink: Timeout waiting for response from MegaLink server (3)
ERROR in link input: callMLink: Timeout waiting for response from MegaLink server (3)
<ENTREZ_DIRECT>
  <Error>callMLink: Timeout waiting for response from MegaLink server (3)</Error>
</ENTREZ_DIRECT>
ERROR in filt input: callMLink: Timeout waiting for response from MegaLink server (3)

在指定的分子量范围内具有39个蛋白质结构并具有所需的（X射线晶体学）原子位置分辨率。

（QueryKey值为7而不是5，因为每个elink命令通过在ELink操作之后立即运行单独的ESearch查询来获取记录计数。）

####### 结合独立查询

可以执行独立的esearch，elink和efilter操作，然后使用历史服务器的“＃”约定来表示查询键号。（要组合的步骤必须位于同一个数据库中。）后续的esearch命令可以使用-db参数覆盖上一步中传送的数据库。（将查询连接在一起对于共享相同的历史记录线程是必要的。）例如，查询：

esearch -db protein -query "amyloid* [PROT]" |
  elink -target pubmed |
  esearch -db gene -query "apo* [GENE]" |
  elink -target pubmed |
  esearch -query "(#3) AND (#6)" |
  efetch -format docsum |
  xtract -pattern DocumentSummary -element Id Title

使用截断搜索（输入单词的开头后跟星号）返回与淀粉样蛋白序列和载脂蛋白基因记录相关的论文标题：

Mac-mini:work adu$ esearch -db protein -query "amyloid* [PROT]" |
>   elink -target pubmed |
>   esearch -db gene -query "apo* [GENE]" |
>   elink -target pubmed |
>   esearch -query "(#3) AND (#6)" |
>   efetch -format docsum |
>   xtract -pattern DocumentSummary -element Id Title
20301340  Alzheimer Disease Overview
28987665  Altered spontaneous brain activity pattern in cognitively normal young adults carrying mutations of APP, presenilin-1/2 and APOE ε4.
28252024  Evolution of complexity in the zebrafish synapse proteome.
28071753  De novo assembly, annotation, and characterization of the whole brain transcriptome of male and female Syrian hamsters.
27626380  High-throughput discovery of novel developmental phenotypes.
27535807  Sex-specific characterization and evaluation of the Alzheimer's disease genetic risk factor sorl1 in zebrafish during aging and in the adult brain following a 100 ppb embryonic lead exposure.
27234028  Alzheimer's disease risk genes in wild-type adult zebrafish exhibit gender-specific expression changes during aging.
27189481  Gene evolution and gene expression after whole genome duplication in fish: the PhyloFish database.
26871637  Widespread Expansion of Protein Interaction Capabilities by Alternative Splicing.
26614614  Multi-tissue transcriptome profiles for coho salmon (Oncorhynchus kisutch), a species undergoing rediploidization following whole-genome duplication.
26469318  RFX transcription factors are essential for hearing in mice.
26319212  Transcriptome sequencing and development of an expression microarray platform for liver infection in adenovirus type 5-infected Syrian golden hamsters.
26107351  Annotation of the Protein Coding Regions of the Equine Genome.
25852190  Integrative analysis of kinase networks in TRAIL-induced apoptosis provides a source of potential targets for combination therapy.
25765076  Tissue-specific transcriptome assemblies of the marine medaka Oryzias melastigma and comparative analysis with the freshwater medaka Oryzias latipes.
25319552  A new rhesus macaque assembly and annotation for next-generation sequencing analyses.
24952961  A high-resolution spatiotemporal atlas of gene expression of the developing mouse brain.
24709693  Genome-wide data reveal novel genes for methotrexate response in a large cohort of juvenile idiopathic arthritis cases.
24705354  The palmitoyl acyltransferase HIP14 shares a high proportion of interactors with huntingtin: implications for a role in the pathogenesis of Huntington's disease.
24658140  The mammalian-membrane two-hybrid assay (MaMTH) for probing membrane-protein interactions in human cells.
24402279  Elephant shark genome provides unique insights into gnathostome evolution.
23962925  Genome analysis reveals insights into physiology and longevity of the Brandt's bat Myotis brandtii.
23258410  Comparative analysis of bat genomes provides insight into the evolution of flight and immunity.
23236062  Sequencing, annotation, and characterization of the influenza ferret infectome.
23149746  Genome sequences of wild and domestic bactrian camels.
23127152  Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote.
20301414  Early-Onset Familial Alzheimer Disease
22751099  The yak genome and adaptation to life at high altitude.
22134011  A preliminary sketch of horn cancer transcriptome in Indian zebu cattle.
22002653  Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques.
21993625  Genome sequencing reveals insights into physiology and longevity of the naked mole rat.
21857902  Transcriptome sequencing of the blind subterranean mole rat, Spalax galili: utility and potential for the discovery of novel evolutionary patterns.
21484476  Major chimpanzee-specific structural changes in sperm development-associated genes.
21048031  Single nucleotide polymorphisms of matrix metalloproteinase 9 (MMP9) and tumor protein 73 (TP73) interact with Epstein-Barr virus in chronic lymphocytic leukemia: results from the European case-control study EpiLymph.
20875843  Cloning, sequencing and expression in the dog of the main amyloid precursor protein isoforms and some of the enzymes related with their processing.
20433749  Salmo salar and Esox lucius full-length cDNA sequences reveal changes in evolutionary pressures on a post-tetraploidization genome.
20403183  Transcriptome sequencing and development of an expression microarray platform for the domestic ferret.
20237496  New genetic associations detected in a host response study to hepatitis B vaccine.
19946888  Defining the membrane proteome of NK cells.
19820115  BeetleBase in 2010: revisions to provide comprehensive genomic information for Tribolium castaneum.
19393038  A whole-genome assembly of the domestic cow, Bos taurus.
19199708  Proteomic analysis of human parotid gland exosomes by multidimensional protein identification technology (MudPIT).
18362917  The genome of the model beetle and pest Tribolium castaneum.
17145712  PEDE (Pig EST Data Explorer) has been expanded into Pig Expression Data Explorer, including 10 147 porcine full-length cDNA sequences.
16710414  The DNA sequence and biological annotation of human chromosome 1.
16335952  Human plasma N-glycoproteome analysis by immunoaffinity subtraction, hydrazide chemistry, and mass spectrometry.
16141072  The transcriptional landscape of the mammalian genome.
16136131  Initial sequence of the chimpanzee genome and comparison with the human genome.
16109975  The zebrafish gene map defines ancestral vertebrate chromosomes.
15642098  Full-length cDNAs from chicken bursal lymphocytes to facilitate gene function analysis.
15489334  The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC).
15202933  Insertion of the amyloid precursor protein into lipid monolayers: effects of cholesterol and apolipoprotein E.
15169875  Identification and verification of novel rodent postsynaptic density proteins.
15164054  The DNA sequence and comparative analysis of human chromosome 10.
15164053  DNA sequence and analysis of human chromosome 9.
15057824  The DNA sequence and biology of human chromosome 19.
14702039  Complete sequencing and characterization of 21,243 full-length human cDNAs.
14681463  PEDE (Pig EST Data Explorer): construction of a database for ESTs derived from porcine full-length cDNA libraries.
12665801  Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides.
12537572  Annotation of the Drosophila melanogaster euchromatic genome: a systematic review.
12520002  BayGenomics: a resource of insertional mutations in mouse embryonic stem cells.
12477932  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences.
12454917  Genetic and genomic tools for Xenopus research: The NIH Xenopus initiative.
12040188  A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome.
11247302  Disabled-2 colocalizes with the LDLR in clathrin-coated pits and interacts with AP-2.
11230166  Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs.
11181995  The sequence of the human genome.
10731132  The genome sequence of Drosophila melanogaster.
8858135 Regulation of tumor necrosis factor-and Fas-mediated apoptotic cell death by a novel cDNA TR2L.
7542371 Expression in mouse embryos and in adult mouse brain of three members of the amyloid precursor protein family, of the alpha-2-macroglobulin receptor/low density lipoprotein receptor-related protein and of its ligands apolipoprotein E, lipoprotein lipase, alpha-2-macroglobulin and the 40,000 molecular weight receptor-associated protein.

使用（＃3）AND（＃6）代替上面的（＃2）AND（＃4）反映了每个elink命令执行单独的ESearch查询的需要，该查询增加QueryKey，以获得记录计数。 -label参数可用于绕过此工件。标签值以“＃”符号为前缀，并放在最终搜索的括号中。从而：

esearch -db structure -query "insulin [TITL]" |
  elink -target pubmed -label struc_cit |
  esearch -db protein -query "insulin [PROT]" |
  elink -target pubmed -label prot_cit |
  esearch -query "(#struc_cit) AND (#prot_cit)" |
  efetch -format uid

将返回：

Mac-mini:work adu$ esearch -db structure -query "insulin [TITL]" |
>   elink -target pubmed -label struc_cit |
>   esearch -db protein -query "insulin [PROT]" |
>   elink -target pubmed -label prot_cit |
>   esearch -query "(#struc_cit) AND (#prot_cit)" |
>   efetch -format uid
25423173
15299880
9235985
9141131
8421693
1433291
1772633
2025410
2905485

无需跟踪内部QueryKey值。